New York City, the most popular city in the United States, is a breeding ground for tourism. Whether you come to New York for the food, the musical and artistic talent, or the nightlife, choosing where you are going to stay is a very important part of the NYC experience! AirBnb, a popular lodging mobile app and website, has been altering the way that we think about where we choose to stay in a new place. The app has become increasingly popular since it was founded in 2008, and the hotel industry has been taking a huge hit ever since. The interesting thing about AirBnb is that, as long as you have a space for guests, anyone can be a host. Therefore, the app is extremely common in big cities like New York, where anybody can rent out their apartment or room for a ton of cash.

Using a sample of almost 50,000 AirBnb listings across all 5 boroughs in NYC in 2019, we gathered some useful information on the factors that AirBnb hosts consider when determining the price of their listings. This could help you in choosing the most affordable option that also satisfies your needs when traveling to NYC.

As college students who love to travel, price is a huge consideration for us when choosing where to stay in a new city, and we are sure that many of you feel the same way. There is definitely a correlation between price and the location of each AirBnb listing in New York. As you can imagine, the more expensive the AirBnb is, the “better” the neighborhood that is located in. And the other way around as well; the cheaper the AirBnb, the “worse” the location it is in.

This is a boxplot depicting the prices of the AirBnb listings in each borough, separated by room type: shared room, private room, or entire home/apartment. For each boxplot, the left edge of the box represents the 25th percentile of price, the middle line of the box represents the mean price, and the right edge represents the 75th percentile of price. All of the dots represent “outliers,” because those prices are far from the mean, and there are not too many of them.

Brooklyn and Manhattan are both the most populated boroughs with AirBnb rentals, and therefore they also contain the highest mean price of listings. Out of just over 25,000 full home/apartment listings, over 13,000 belong to Manhattan, and over 9,500 belong to Brooklyn. Staten Island and the Bronx are low on this list, with 176 and 379, respectively. We can see from the graph that even as number of listings increase, price also increases. Therefore, the demand in Brooklyn and Manhattan is much higher than the demand in other boroughs.

This pattern continues for the other two room types: private room and shared room. It is unsurprising that the average price for a private room and a shared room decreases tremendously from the entire home, as for most of the boroughs, the 75th percentile of the price of shared homes are less than the 25th percentile of the price of private homes. In addition, the average across all boroughs of the shared room type had an average price of $70.13, while the private room had an average price of $89.78, and the entire home/apartment had an average price over double that: $211.79. This is likely due to the lack of privacy that a shared home brings, since facilities are often shared, and guests will often have to interact with their hosts as they enter and leave the home. For New York City travelers, this is much less likely to be an issue than for travelers to an area where there isn’t as much to do. NYC travelers are often out and about all day anyway, so if you don’t mind seeing someone else, or possibly waiting to use the restroom, it will save you a ton of money to share a home with your host, or whoever else is occupying the space.

Personally, we have never heard of “shared rooms” before we conducted this analysis, so we aren’t entirely sure what that entails. It does sound like a bit too much privacy is extracted with that one, so since the price difference is not too severe, we would recommend the private room in a home as your best bet.

Are you ever looking for the perfect place to stay and one of the listings catches your eye with just one word? Well, whether you realize it or not, the language used to describe AirBnb listings could possibly have more of an effect than you know. We compared the average and median prices of AirBnb listings in Manhattan that contained the word “cozy,” and listings that contained the word “luxury,” and compared both of those to the average and median prices of all of the listings in Manhattan.

count Word mean(price) median(price)
1857 Cozy 129.5256 109
925 Luxury 324.8249 220
21661 196.8758 150

The table to the right shows that words such as “cozy” and “luxury” are associated with different prices of AirBnb listings. The word “cozy” seems to be associated with less expensive listings, while the word “luxury” seems to be associated with more expensive listings. This could have to do with the connotation that these words have in our minds: “cozy” usually seems to resemble a smaller, lived in, maybe even slightly cluttered apartment, while luxury seems to describe a spacious, modern place. However, you cannot directly prove that a place is “cozy” or “luxury”! Overall, we are not saying that these words influence the price, but the price may influence the words that the host uses to describe their home. Keep this interesting observation in mind the next time that you are browsing through listings.

So, the question now becomes, what makes up the most expensive AirBnb listings? Well, we narrowed down the top 40 most expensive listings into a graph, so that we can see if specific location makes a difference in price. The price for these listings ranges from $3,600 to $8,000 per night! The blue points represent full homes/apartments, while grey points represent a private room in an apartment.

We can conclude that Manhattan and Brooklyn have the most traffic through AirBnbs, and the most expensive ones at that. Also, these pricey homes are for the most part spread out, not all of them are in the same neighborhood. You can also see that the majority of these top 40 most expensive AirBnbs are for the entire apartment. Out of the 40, 6 of those AirBnbs are only renting out a private room, those must be some nice private rooms!

Overall, we can conclude that there are many things to consider when choosing a NYC AirBnb to make sure that you are getting the best deal for your money. Price is influenced by room type and borough, and probably specific neighborhoods in each borough. Therefore, in order to stay within budget, most people will have to make trade offs when choosing where to stay in New York. This could look like staying in Queens to avoid the prices of Manhattan, or choosing a private room in a shared home to avoid the privacy expenses. Whatever you choose to do, make sure to decide what is most important to you, and try to stay within reason.

Exploratory Analysis:

Abstract

The following is an analysis covering the data that we used to support the claims made in our blog post, and how we manipulated the data as we worked with it. Our blog post summarized our findings from the dataset we used, containing thousands of AirBnb listings from 2019, and specific information on each.

Overview and Motivation

We decided to do a data analysis focusing on factors that affect the price of NYC AirBnb listings. Tourism in NYC is extremely popular, so we wanted to look into what factors make up the expenses of these listings. This is an interesting location to analyze, because although it is a large tourist destination, as it hosts sports games, Broadway shows, and much more, New York is also the home to over 8 million people. Therefore, it is also a very popular city to travel to for work related reasons. As AirBnb is slowly disrupting the hotel industry, we felt that it would be interesting to look into the factors that allow listings to have such a high price tag.

Initial Questions

We wanted to find out which factors affected AirBnb listing prices, and which factors do not. At the beginning of our analysis, we were not sure if we would focus on pricing or popularity of listings, but as shown below in the Exploratory Data Analysis section, we ended up working mostly with prices. We wanted to form questions that would be meaningful to our audience, and could potentially help them in their decision making on an AirBnb listing down the road.

Our initial questions were:

We go into more detail about how we went about these questions, and why we did not answer all of them in the sections below.

Data

At first, we pulled data using two datasets: AB_NYC_2019 (https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) and Airbnb_NYC (https://www.kaggle.com/sarthakniwate13/air-bnb-nyc-data). After making a few plots, we realized that the data was pretty much the same in both, which we were happy with; that meant that they probably reflected similar listings and the data was more likely to be accurate. Since one dataset alone had almost 50,000 observations, we felt that there was no need for both. We ultimately chose AB_NYC_2019 to be our sole dataset because it had more observations, and it included the listing descriptions, which we felt could be useful as we went deeper into our analysis. Also, AB_NYC_2019 clearly specifies that its data is from 2019 listings, whereas Airbnb_NYC does not specify a timeframe.

library(tidyverse)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(data.table)
library(leaflet)
library(kableExtra)
AB_NYC_2019<-read.csv("/Users/carolynradle/Documents/Math 488P/2_eda/data/AB_NYC_2019.csv")
colnames(AB_NYC_2019)
##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"
AB_NYC_2019 <- rename(AB_NYC_2019, borough = neighbourhood_group)

The dataset has 16 columns. We wanted to utilize the majority of them to our benefit, so that we could really analyze which factors had the biggest effect on price. We changed the name of the “neighbourhood_group” column to be called “borough,” for less confusion between that and the “neighbourhood” column.

In the beginning stages of our analysis, we thought that it would be interesting to look at price vs the availability per year. We went ahead and took a look at how the availability per year column looked:

head(AB_NYC_2019$availability_365, 10)
##  [1] 365 355 365 194   0 129   0 220   0 188

In just the first 10 rows, we noticed that the data was a little suspicious, because some listings had 365 day availability, and some had 0. The numbers in the 100s and 200s looked more normal to us. We thought about the many reasons listings could be available or unavailable: maybe family of the host or the host themselves are staying in the apartment for the time being, maybe the host is not active in booking out their listings, and then of course maybe the listing is super popular or unpopular to the general public based on how it is perceived on the app. Either way, we cannot control why the availability numbers are the way they are, and there were too many 0’s and high 300 numbers that we were not comfortable with. Taking out this data was also an option, but maybe those apartments were just super popular or unpopular, and we did not want to skew the data too much, so we decided to abandon this column altogether.

The “name” column, which is the listing descriptions as mentioned before, really caught our eye.

head(AB_NYC_2019$name, 10)
##  [1] "Clean & quiet apt home by the park"              
##  [2] "Skylit Midtown Castle"                           
##  [3] "THE VILLAGE OF HARLEM....NEW YORK !"             
##  [4] "Cozy Entire Floor of Brownstone"                 
##  [5] "Entire Apt: Spacious Studio/Loft by central park"
##  [6] "Large Cozy 1 BR Apartment In Midtown East"       
##  [7] "BlissArtsSpace!"                                 
##  [8] "Large Furnished Room Near B'way "                
##  [9] "Cozy Clean Guest Room - Family Apt"              
## [10] "Cute & Cozy Lower East Side 1 bdrm"

Going through the rows of this column, we noticed that most hosts used adjectives that could not be measured to describe their home. We liked the words “cozy” and “luxury,” because although both are meant to be positive adjectives, they could describe very different listings. We wanted to see if these words correlated to different prices for their respective listings.

Exploratory Data Analysis

We began our analysis by not being completely sure if we were going to compare most of our factors to price, or to popularity of listings. The data contains columns such as “number_of_reviews”, “reviews_per_month,” etc. This would be beneficial to compare other factors to the popularity of a listing, in absence of the availability column as mentioned earlier. We also utilized the actual number of listings in our analysis, because this definitely showed us where the popular AirBnb areas are. This was one of the first plots we made:

AB_NYC_2019 %>%
    group_by(neighbourhood) %>%
    summarize(num_listings = n(), 
              borough = unique(borough)) %>%
    top_n(n = 8, wt = num_listings) %>%
    ggplot(aes(x = borough, 
               y = num_listings, fill = neighbourhood)) +
    geom_col() +
    labs(title = "Most Common Neighborhoods",
         x = "Borough" , y = "Number of Listings")

This graphic gave us a good idea of the most common neighborhoods for NYC travel. We then decided to see if the most common neighborhoods were similar to the most expensive neighborhoods.

pop_neighborhoods <- AB_NYC_2019 %>%
  group_by(neighbourhood) %>%
  summarize(num_listings= n(),
            borough = unique(borough),
            median_price = median(price)
            ) %>%
  arrange(desc(median_price))
head(pop_neighborhoods, 15)
## # A tibble: 15 x 4
##    neighbourhood      num_listings borough       median_price
##    <chr>                     <int> <chr>                <dbl>
##  1 Fort Wadsworth                1 Staten Island         800 
##  2 Woodrow                       1 Staten Island         700 
##  3 Tribeca                     177 Manhattan             295 
##  4 Neponsit                      3 Queens                274 
##  5 NoHo                         78 Manhattan             250 
##  6 Willowbrook                   1 Staten Island         249 
##  7 Flatiron District            80 Manhattan             225 
##  8 Midtown                    1545 Manhattan             210 
##  9 Financial District          744 Manhattan             200 
## 10 West Village                768 Manhattan             200 
## 11 Chelsea                    1113 Manhattan             199 
## 12 SoHo                        358 Manhattan             199 
## 13 Greenwich Village           392 Manhattan             198.
## 14 Battery Park City            70 Manhattan             195 
## 15 Breezy Point                  3 Queens                195

Once we came to the conclusion that the most expensive listings are not among the most popular, we decided to focus on the factors that correlate to higher prices of listings , rather than merely popularity, because price is the largest consideration for prospective travelers. Therefore, in order to make this data meaningful, we wanted to have a focus that resonates with most other travelers.

We took out the listings in our pop_neighborhoods dataset that had less than 5 listings, and we graphed the top 10 most expensive neighborhoods by median price, which all happened to be in Manhattan.

most_exp_neighborhoods <- pop_neighborhoods[c(3,5,7,8,9,10,11,12,13,14),]

most_exp_neighborhoods %>%
  ggplot() + geom_col(aes(x = neighbourhood, y = median_price)) + coord_flip() + labs(x = "Neighborhood" , y = "Median Price")

We chose to focus on the median price rather than the mean price, because we did not want to account for the many outliers that we know are present. We instead accounted for those outliers in our map of NYC, mapping the 40 most expensive AirBnbs, which ranged from $3,600 to $8,000.

We chose to filter out the listings with a minimum stay of 5 days or more, because a stay that long is not too popular with NYC travelers, so it wouldn’t make sense to compare that to a listing with a minimum stay of 1 or 2 days.

top_listings <- AB_NYC_2019 %>% filter(minimum_nights <5) %>% top_n(n = 40, wt = price) %>% arrange(desc(price))
head(top_listings, 10)
##          id                                      name   host_id host_name
## 1   2953058                             Film Location   1177497   Jessica
## 2  22779726 East 72nd Townhouse by (Hidden by Airbnb) 156158778     Sally
## 3  33007610       70' Luxury MotorYacht on the Hudson   7407743      Jack
## 4  34895693                      Gem of east Flatbush 262534951    Sandra
## 5  33998396          3000 sq ft daylight photo studio   3750764     Kevin
## 6   2271504          SUPER BOWL Brooklyn Duplex Apt!!  11598359  Jonathan
## 7  22780103 Park Avenue Mansion by (Hidden by Airbnb) 156158778     Sally
## 8  12520066        Luxury townhouse Greenwich Village  66240032     Linda
## 9   2243699       SuperBowl Penthouse Loft 3,000 sqft   1483320      Omri
## 10  1448703       Beautiful 1 Bedroom in Nolita/Soho     213266   Jessica
##      borough     neighbourhood latitude longitude       room_type price
## 1   Brooklyn      Clinton Hill 40.69137 -73.96723 Entire home/apt  8000
## 2  Manhattan   Upper East Side 40.76824 -73.95989 Entire home/apt  7703
## 3  Manhattan Battery Park City 40.71162 -74.01693 Entire home/apt  7500
## 4   Brooklyn     East Flatbush 40.65724 -73.92450    Private room  7500
## 5  Manhattan           Chelsea 40.75060 -74.00388 Entire home/apt  6800
## 6   Brooklyn      Clinton Hill 40.68766 -73.96439 Entire home/apt  6500
## 7  Manhattan   Upper East Side 40.78517 -73.95270 Entire home/apt  6419
## 8  Manhattan Greenwich Village 40.73046 -73.99562 Entire home/apt  6000
## 9  Manhattan      Little Italy 40.71895 -73.99793 Entire home/apt  5250
## 10 Manhattan            Nolita 40.72193 -73.99379 Entire home/apt  5000
##    minimum_nights number_of_reviews last_review reviews_per_month
## 1               1                 1  2016-09-15              0.03
## 2               1                 0                            NA
## 3               1                 0                            NA
## 4               1                 8  2019-07-07              6.15
## 5               1                 0                            NA
## 6               1                 0                            NA
## 7               1                 0                            NA
## 8               1                 0                            NA
## 9               1                 0                            NA
## 10              1                 2  2013-09-28              0.03
##    calculated_host_listings_count availability_365
## 1                              11              365
## 2                              12              146
## 3                               1              364
## 4                               2              179
## 5                               6              364
## 6                               1                0
## 7                              12               45
## 8                               1                0
## 9                               1                0
## 10                              1              365

We included in our blog post the map of where these top 40 unbiased listings are located. To do this, we used the longitude and latitude coordinates given in the dataset, and the r package “leaflet.” We felt that it was extremely important to add this visual element, to show that the most expensive listings are not all crowded in the same place, or even in the same borough. Navy blue points represent entire homes, and grey points represent a private room in a home.

pal <- colorFactor(c("navy", "grey"),
                   domain = c("Entire home/apt", "Private Room"))
                   

leaflet(top_listings) %>%
  addTiles("Most Expensive NYC AirBnb Listings") %>%
  setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(
    radius = 10,
    color = ~pal(room_type),
    stroke = TRUE
)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
## Warning in pal(room_type): Some values were outside the color scale and will be
## treated as NA

## Warning in pal(room_type): Some values were outside the color scale and will be
## treated as NA

Due to the variability of this graph and our data, we cannot make an assumption of the cause for these high prices. We can guess that the price of these listings have more to do with the space themselves, rather than the area of the city they are in, because besides for a few closeby in lower manhattan, the top 40 listings seem spread out.

We then wanted to create a plot that summarized all of the AirBnb prices in a simple, concise way. After developing several different types of plots, we decided that the boxplot would fit our data the best, since we had a lot of outliers that needed to be accounted for. We first separated our plot by price and room type:

AB_NYC_2019 %>%
  group_by(room_type) %>%
  summarize(
     N = n(),
    avg_price = mean(price),
    med_price = median(price)
  )
## # A tibble: 3 x 4
##   room_type           N avg_price med_price
##   <chr>           <int>     <dbl>     <dbl>
## 1 Entire home/apt 25409     212.        160
## 2 Private room    22326      89.8        70
## 3 Shared room      1160      70.1        45
AB_NYC_2019 %>%
  filter%>%
  ggplot() + geom_boxplot(mapping = aes(x = price, y = room_type))+ 
  coord_cartesian(xlim =c(0, 500)) + labs(y = "Room Type", x = "Price")

We created an x-axis limit of 500 because it is well above the mean of each room type, but with 500 as the limit you can still get a good view of the outliers. This graph was a good starting point, but we needed to separate it further. We already know that the boroughs play a big role in the different prices, so we let color map to the room type, and we changed the y-axis to represent each borough.

AB_NYC_2019 %>%
  ggplot() + geom_boxplot(aes(x = price, y = borough, color = room_type)) + 
  coord_cartesian(xlim =c(0, 500)) + labs(x = "Price", y = "Borough", title = "Price by Borough")

AB_NYC_2019 %>%
  group_by(borough, room_type) %>%
  summarize(
    N = n(),
    avg_price = mean(price)
  )
## `summarise()` has grouped output by 'borough'. You can override using the `.groups` argument.
## # A tibble: 15 x 4
## # Groups:   borough [5]
##    borough       room_type           N avg_price
##    <chr>         <chr>           <int>     <dbl>
##  1 Bronx         Entire home/apt   379     128. 
##  2 Bronx         Private room      652      66.8
##  3 Bronx         Shared room        60      59.8
##  4 Brooklyn      Entire home/apt  9559     178. 
##  5 Brooklyn      Private room    10132      76.5
##  6 Brooklyn      Shared room       413      50.5
##  7 Manhattan     Entire home/apt 13199     249. 
##  8 Manhattan     Private room     7982     117. 
##  9 Manhattan     Shared room       480      89.0
## 10 Queens        Entire home/apt  2096     147. 
## 11 Queens        Private room     3372      71.8
## 12 Queens        Shared room       198      69.0
## 13 Staten Island Entire home/apt   176     174. 
## 14 Staten Island Private room      188      62.3
## 15 Staten Island Shared room         9      57.4

Lastly, we utilized the “name” column of the dataset to filter out certain words in the listing descriptions that we felt may have a correlation to price. We also filtered the borough to only be Manhattan to try to control bias as much as possible. In all honesty, this part of our data is not super meaningful in that it will help readers determine how to find cheaper listings, because we do not believe that these words hold a direct impact on the price. However, we think that it is interesting that cheaper than average listings tend to contain the word “Cozy,” while more expensive than average listings tend to contain the word “Luxury”:

cozy_rooms <- AB_NYC_2019 %>% 
  filter(borough == "Manhattan" , grepl( "Cozy | cozy" , name)) 

x <- cozy_rooms %>%
  group_by(price) %>%
  summarise(n=n()) %>%
  ggplot() + geom_line(aes(x = price, y = n)) + xlim(0,500) + labs(title = "Price of 'Cozy' airbnbs in Manhattan")


totals <- AB_NYC_2019 %>% 
  filter(borough == "Manhattan")

y <- totals  %>%
  group_by(price) %>%
  summarise(n=n()) %>%
  ggplot() + geom_line(aes(x = price, y = n))+ xlim(0,500) + labs(title = "Price of all airbnbs in Manhattan")


luxury_rooms <- AB_NYC_2019 %>% 
  filter(borough == "Manhattan" , grepl( "Luxury | luxury" , name))

z <- luxury_rooms %>%
  group_by(price) %>%
  summarise(n=n()) %>%
  ggplot() + geom_line(aes(x = price, y = n)) + 
  xlim(0,500) + labs(title = "Price of 'Luxury' airbnbs in Manhattan")

grid.arrange(x,y,z)
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 179 row(s) containing missing values (geom_path).
## Warning: Removed 56 row(s) containing missing values (geom_path).

When we made these graphs, we felt that they did a great job in portraying that Luxury AirBnbs tend to be more expensive than average, and there are far less of them overall. However, the graphs for all AirBnbs and “Cozy” AirBnbs seem very similar with these graphs, even though we know with our computations, they are actually quite different. This is why in our blog post, we stuck to the mean and median prices, rather than the graphs:

c<- cozy_rooms %>%
  summarize(
    count = n(),
    Word = "Cozy",
    mean(price),
    median(price)
  )

l<- luxury_rooms%>%
  summarize(
    count = n(),
    Word = "Luxury",
    mean(price),
    median(price)
  )

t <-totals%>%
  summarize(
    count = n(),
    Word = " ",
    mean(price),
    median(price)
  )

alist <- list(c,l,t)
rbindlist(alist)
##    count   Word mean(price) median(price)
## 1:  1857   Cozy    129.5256           109
## 2:   925 Luxury    324.8249           220
## 3: 21661           196.8758           150
blist <- rbindlist(alist)
kbl(blist) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
count Word mean(price) median(price)
1857 Cozy 129.5256 109
925 Luxury 324.8249 220
21661 196.8758 150

This is much simpler to read, and it definitely conveys our point, with and without outliers.

Final Analysis

From our data analysis, we have found that room type, neighborhood and borough have an affect on the price of NYC AirBnb listings. We also have found that being in a neighborhood with more listings does not affect the price of AirBnb listings. We were not able to determine whether there are correlations between availability of the listing, and minimum nights and price. The dataset did not provide us enough information to make these assumptions. We also determined that the price of AirBnbs containing the word “cozy” in their description seem to be lower than average. Also the price of AirBnbs containing the word “luxury” in their description seem to be higher than average. These words are just observations, not a direct cause of the price. Overall, each of these factors work together to determine the price of an AirBnb listing, so there is no single variable that changes the prices. However, these variables may have a large affect on the final price tag for an AirBnb listing.